ResSAT: Enhancing Spatial Transcriptomics Prediction from H&E- Stained Histology Images with Interactive Spot Transformer

Spatial transcriptomics (ST) revolutionizes RNA quantification with high spatial resolution. Hematoxylin and eosin (H&E) images, the gold standard in medical diagnosis, offer insights into tissue structure, correlating with gene expression patterns. Current methods for predicting spatial gene expression from H&E images often overlook spatial relationships. We introduce ResSAT (Residual networks - Self-Attention Transformer), a framework generating spatially resolved gene expression profiles from H&E images by capturing tissue structures and using a self-attention transformer to enhance prediction.Benchmarking on 10× Visium datasets, ResSAT significantly outperformed existing methods, promising reduced ST profiling costs and rapid acquisition of numerous profiles.


Background
The rapid advancement of spatial transcriptomics (ST) technology has revolutionized the eld of RNA abundance quanti cation by offering remarkable spatial resolution for concurrent gene expression pro ling and precise spatial spot localization 1 .This breakthrough allows researchers to generate detailed maps of gene expression within tissues, providing unprecedented insights into cellular function and tissue organization 2 .However, currently, the resource-intensive and time-consuming nature of ST pro ling has limited its widespread application, creating a demand for more accessible methods.
In contrast, Hematoxylin and eosin (H&E) staining is a widely used histological technique that provides detailed insights into tissue structure and composition at a microscopic level 3,4 .H&E images are instrumental in medical diagnostics, offering clear visualizations of cellular morphology and tissue architecture 5 .As the gold standard in many diagnostic procedures, H&E imaging is cost-effective, readily available, and extensively utilized in clinical settings.
Moreover, there is a close relationship between H&E staining and ST, as H&E images capture detailed cellular and tissue morphology that correlates with gene expression patterns 6 .The staining highlights different cellular components: hematoxylin stains the cell nuclei blue, indicating areas of high DNA concentration, while eosin stains the cytoplasm and extracellular matrix pink, showing the structural context 7 .These visual cues from H&E images re ect the underlying biological processes and molecular activities within the tissue 3,4 , providing a spatial map that can be linked to gene expression data.
Previous studies have also demonstrated that changes in gene expression or genetic mutations often in uence cell morphology, structure, and distribution, resulting in alterations in histological features 8, 9 .This correlation enables the use of H&E images to predict spatial gene expression pro les, leveraging the morphological context provided by the staining to infer molecular states.Recognizing the potential to integrate these two powerful tools, recent studies [10][11][12] have focused on developing computational approaches to predict ST data from H&E images.This innovative synergy aims to leverage the comprehensive tissue insights provided by H&E images to infer spatial gene expression pro les.By doing so, these approaches can potentially circumvent the limitations of ST technology, making highresolution gene expression mapping more accessible and practical.
Several existing approaches, such as ST-Net 6 , HisToGene 11 , and BLEEP 12 , have shown promising results in predicting expression from histology images.Both ST-Net and HisToGene treat expression prediction as regression tasks and train them in a feed-forward manner.ST-Net utilizes a ResNet50 image encoder, while HisToGene utilizes a vision transformer backbone.BLEEP, on the other hand, draws inspiration from contrastive language-image pretraining to establish a comparable joint embedding between spot expression pro les and their spatially paired image patches.Although HisToGene incorporates spatial location information, it does not explicitly consider spatial relationships between different locations.
It's worth highlighting existing methods capable of generating spatially resolved expression predictions, either have limitations regarding the predicted panel (ST-Net, HisToGene, and BLEEP) or are prone to over tting (HisToGene).Additionally, existing approaches in spatial expression prediction from H&E images often overlook the crucial relationships between different spatial locations, which provide essential context about tissue architecture, including the organization and interaction of cells, and spatial heterogeneity, and thus re ect variations in cellular composition and functions across different tissue regions.These factors may signi cantly impact biological interpretation and predictive accuracy 13 .To address these challenges, we proposed a novel approach called ResSAT (Residual networks -Self-Attention Transformer) for predicting spatial transcriptomics pro les using H&E-stained histology images.We utilized a ResNet50 architecture to extract comprehensive image features from the H&E images, enabling our model to capture diverse characteristics of tissue structures and cellular compositions depicted in the images.Additionally, we introduced a self-attention transformer mechanism to identify and cluster spots with high correlation, allowing the model to focus on interactions between spots and enhance spatial gene expression prediction performance.
We validated the effectiveness of ResSAT by benchmarking its performance on two different mice brain datasets obtained from the 10x Visium platform.Our results demonstrated signi cant improvements over existing methods such as BLEEP 12 , HisToGene 11 , and ST-Net 10 in terms of mean correlations across all genes and mean correlations among the top 50 highly expressed genes.This innovative approach signi cantly enhances the performance of spatial gene expression prediction from histology images.The proposed framework has the potential to substantially reduce the time and cost associated with spatial transcriptomics pro ling, opening up new possibilities for acquiring numerous spatial transcriptomics pro les rapidly and reconstructing comprehensive 3D spatial transcriptomics from adjacent 2D spatial transcriptomics pro les.

Results
ResSAT enables gene expression prediction and consistently performs well.
To assess the predictive e cacy of ResSAT in quantifying gene expression, we applied our method to both the sagittal anterior (SA) and sagittal posterior (SP) datasets.We compared ResSAT to three other methods for spatial gene expression prediction, including BLEEP, HisToGene, and ST-Net.The predicted expression pro les from ResSAT showed the highest mean correlation with ground truth, achieving an increase in PCCs for both mean correlations of all genes (Table 1) and top 50 most highly expressed genes (HEGs) (Table 2) in the two different datasets.To ensure robustness and reliability, the evaluations were repeated ve times, and the resulting mean and standard deviation values were computed for further analysis and validation.Respectively, for each section within the SA and SP datasets, we trained ResSAT using one section and evaluated the correlation between predicted gene expression and actual gene expression on the other section.Speci cally, after training on Section 2 and testing on Section 1 as shown in the Tables 1 and 2, we also trained on Section 1 and tested on Section 2 to validate the results.As illustrated in Fig. 2, ResSAT consistently yielded the highest PCCs between spatially resolved gene expression and actual gene expression across all sections in both datasets.
Examining the effect of each module within ResSAT on the predicted gene expression results.
In order to better understand why ResSAT performs better than other methods, we conducted an analysis to see how each component contributes to its performance.We did this by removing certain modules of ResSAT and observing the impact on its ability to predict gene expression, as outlined in Figs. 3 and 4. We found that keeping all modules intact resulted in the strongest correlation between predicted and observed gene expression.Speci cally, when we excluded the ResNet module and SAT module respectively, we noticed an average PCC decrease in performance in both the SA and SP datasets.This indicates that the ResNet module and SAT module are both crucial for ResSAT to effectively uncover the relationships between different spots.In summary, our ablation experiment highlights the importance of maintaining all modules within ResSAT to achieve optimal performance in predicting gene expression.
We examined whether the predicted gene expression by ResSAT accurately mirrors the actual expression of brain-related genes.
Within the SA dataset, we assessed the correlation between observed and predicted gene expression, computing the PCC for each gene.We then ranked these genes in descending order of their PCCs and selected the predicted genes with the top 5 highest correlations obtained from our method for visualization (CALB2, GNG4, CDHR1, DOC2G, and SHISA8), as shown in Fig. 5. CALB2 (Calretinin) expression in the mouse olfactory bulb is associated with inhibitory interneurons 32 .These interneurons play essential roles in regulating neural circuits involved in processing sensory information related to odors.GNG4 (G Protein Subunit Gamma 4) is a gene encoding a subunit of heterotrimeric G proteins, which are involved in signal transduction pathways in neurons 33 .In the mouse olfactory bulb, G proteins play a crucial role in mediating signaling pathways involved in odorant detection and processing.Heterotrimeric G proteins, consisting of alpha, beta, and gamma subunits, are involved in transducing signals from odorant receptors to downstream effector molecules, leading to neuronal activation and olfactory perception 33 .CDHR1 (Cadherin-Related Family Member 1) is a gene expressed in the olfactory bulb of mice, where it likely contributes to the organization and maintenance of the olfactory sensory epithelium 34, 35 .DOC2G, also known as Double C2 Domain Gamma, is a gene encoding a calciumbinding protein involved in vesicle exocytosis and neurotransmitter release.In the mammalian olfactory system, complex information processing starts in the olfactory bulb, whose output is conveyed by mitral cells (MCs) and tufted cells (TCs) 36 .DOC2G was identi ed to be differentially expressed between MCs and TCs of the mouse 36 .SHISA8, a member of the Shisa family of transmembrane proteins, has a broad role in synaptic function and neuronal development, suggesting its potential involvement in olfactory processing 37 .
To further validate the robustness of ResSAT in predicting brain-related genes, we extended our analysis to the SP dataset, achieving similarly accurate predictions, as shown in Fig. 6.ResSAT enabled accurate prediction of key genes associated with mouse brain.Figure 6

Discussion
To address the challenges inherent in spatial transcriptomics prediction, we devised a novel approach called ResSAT.Leveraging a ResNet50 architecture, we extracted comprehensive image features from provided H&E images.This method enabled our model to capture a wide range of tissue structures and cellular compositions depicted in the images.We then introduced a self-attention transformer mechanism to cluster spots exhibiting high correlation.This empowered the model to focus on interactions between spots, thereby enhancing spatial gene expression prediction performance.
In our experimental evaluations on benchmark ST datasets, our proposed method demonstrated its effectiveness in accurately predicting spatial gene expression patterns in two different mice brain datasets.We achieved signi cantly higher mean correlations across all genes and the top 50 most highly expressed genes (HEGs), representing substantial improvements compared to existing methods.Additionally, ResSAT exhibited similar expression patterns for the top 5 predicted genes compared to observed expression pro les.The results accurately predicted spatial correlations of genes, with the locations of predicted genes closely matching the spatial locations of observed genes.This underscores the e cacy of our approach in spatial transcriptomics prediction from H&E images, indicating its potential to generate numerous spatial transcriptomics pro les e ciently.
Our focus in this paper primarily centers on mouse brain datasets, laying the groundwork for our subsequent 3D reconstructed map of brain regions in spatial transcriptomics.Despite the successful prediction of speci c genes showcased in Figs. 5 and 6, we acknowledge that the overall absolute correlations in Tables 1 and 2

Conclusion
We introduced ResSAT, a novel framework for predicting spatial gene expression pro les from H&Estained histology images using a ResNet50 architecture and a self-attention transformer mechanism.
ResSAT effectively captures tissue structures and clusters correlated spots to enhance prediction performance.Our evaluations on mouse brain datasets demonstrated signi cant improvements over existing methods, with higher mean correlations and accurate spatial predictions.

Despite challenges like low absolute correlations for certain genes and limited tissue sections, ResSAT
outperformed current methods, showing potential for e cient and cost-effective ST pro ling.Future availability of more training data is expected to further enhance ResSAT's performance and robustness, advancing the eld of spatial transcriptomics.

Methods
Datasets: Two mouse brain datasets (SA and SP).
ST pro ling of the anterior part and posterior part of the mouse brain tissue sagittal sections was generated with the Visium technology from 10x Genomics [14][15][16][17] .Both datasets consist of serial H&E histology images and paired gene expressions at the spatial spots and their coordinates.The analyzed gene expression count matrices are outputs of the SpaceRanger pipeline 18 .
The anterior sagittal dataset comprises two sets of H&E-stained histology images, alongside corresponding spatial gene expression data 14,15 .In slice 1, the spatial transcriptomics (ST) data encapsulates the expression pro les of 31,053 genes across 2,825 spots, given by read counts.Slice 2 features ST data for 31,053 genes across 2,696 spots, also detailed through read counts.
The posterior sagittal dataset includes two collections of H&E-stained histology images and their associated spatial gene expression data 16,17 .For the rst slice, spatial transcriptomics (ST) analysis reveals the expression levels of 32,285 genes across 3,355 spots, given by read count data.The second slice presents ST information for the same number of genes, but across 3,289 spots, with expression quanti ed similarly through read counts.

Data preprocessing
For the H&E histology images, we extracted patches based on the size and location of each spot.Each patch was a 224x224 image centered around a spot, approximately 55µm on each side, and paired with the spot's corresponding gene expression pro le.For the spatial gene expression pro les of each tissue section, each spot was normalized to the total count and log normalized.The union of the top 1,000 most highly variable genes from each of the slices was used for training and prediction.Finally, the expression data of these samples were batch corrected using Harmony 19 before one of the slices was randomly selected to be held out for testing.For these two datasets, the slice 2 was selected to be held out for training, while the slice 1 was selected to be held out for testing.
Learning image embedding for expression prediction.
Residual networks (ResNets) 20 are widely recognized architectures in image classi cation, initially acclaimed for their signi cant advancement upon introduction.They continue to be a benchmark in various image analyses [21][22][23] and serve as baselines in image studies introducing novel architectures 24,25 .Our focus in this paper was to extract highly representative features from H&E images using ResNet50.During the training phase, we utilized a dataset comprising pairs of H&E images and their corresponding gene expressions.Each image patch was represented as a 3D tensor , where and denoted the width and height of the patch, respectively.The associated gene expression was represented as a -dimensional vector in .
Our objective was to develop a deep learning framework capable of accurately predicting gene expression from a given image patch.We conceptualized our model as comprising two main components: a backbone network, denoted as , for initial feature extraction module, followed by a feature re nement module, , to enhance the predictive capability of the extracted features.To optimize the performance of both modules, we employed the Mean Squared Error (MSE) loss function during the training process as follows: , where denotes the L-2 norm to measure the difference between predicted gene expression and ground-truth one.
The emergence of transformers has led to their widespread application in image and omics data analysis, as demonstrated by numerous studies [26][27][28][29] .Particularly, recent transformer-based architectures have drawn attention to self-attention and cross-attention mechanisms, offering means to capture interdependencies between different input modalities 30,31 .
In the context of our spatially resolved gene expression prediction approach, the Self-Attention Transformer (SAT) plays a crucial role in exploring spot-spot interactions.By clustering spots that exhibit high correlation, SAT enhances our model's ability to represent gene expression accurately.This approach facilitates a deeper understanding of spatial relationships within the data, contributing to improved predictive performance.
In our endeavor to analyze H&E images for spot interaction, we proposed the development of a spotinteraction module, as shown in Fig. 1.This module aimed to re ne predictions of gene expression.To streamline the model and minimize the number of parameters, we introduced a non-parametric attentive module.The operation of this attention module was de ned by the following Equation: , where , , and represent the key, query, and value matrices of a Transformer module.To model the spot interaction, we aimed to learn a compact gene expression correlation between the spots in the training set.Here, represented the input features, and and denoted transformation functions.The aim of this formulation was to explore the matrix of spot-spot interactions.By doing so, it clustered spots that exhibit high correlation, thereby enhancing the robustness of the gene expression representation.Moreover, this approach elucidated the correlation among spots, improving the model's inference capabilities.To enhance the model's learning process by integrating knowledge of ground-truth gene expressions, we introduced a secondary MSE loss function.This additional MSE loss was formulated as follows: .
Integrating dual objectives into a uni ed framework, we de ned the overall loss function as follows: .
This combined loss framework was designed to enhance the model's predictive performance by balancing the feature extraction and interactive feature re nement.By carefully summing up the contribution of the primary and secondary MSE losses, the model was steered to pay detailed attention to the subtleties and complexities of gene expression data.This, in turn, was expected to improve the model's capacity for capturing the intricate biological relationships that were represented within H&E images.

Evaluated Metrics
In this study, we employed the Pearson correlation coe cient (PCC) to measure the spatial gene expression predicted by ResSAT with the observed gene expression, in order to assess their level of correlation.The PCC had a range of values between − 1 and 1.It was determined by dividing the covariance of two variables by the product of their individual standard deviations: , where denotes covariance; and represents the original gene expression and the gene expression obtained by prediction, respectively.means standard deviation.
We computed the mean correlation of all genes as follows: , where represented the Pearson correlation coe cient between the true and predicted expression values of the -th gene, and , respectively.is the number of all genes.
In addition to the mean correlation of all genes, we computed the mean correlation of the top 50 most highly expressed genes (HEGs) in predicted gene expression, compared with the observed gene expression.This metric provided insight into the performance of the prediction method speci cally for genes that are highly expressed in the spatial context.Technically, the mean correlation of the top HEGs was calculated as follows: ,  Comparative evaluation of spatial transcriptome prediction methods.SA_Section 1 represented Section 2 in the SA dataset was selected to be held out for training, while the Section 1 was selected to be held out for testing.SA_Section 2 represented Section 1 in the SA dataset was selected to be held out for training, while the Section 2 was selected to be held out for testing.SP_Section 1 represented Section 2 in the SP dataset was selected to be held out for training, while the Section 1 was selected to be held out for testing.SP_Section 2 represented Section 1 in the SP dataset was selected to be held out for training, while the Section 2 was selected to be held out for testing.
Ablation experiments on the SA dataset.
Ablation experiments on the SP dataset. Figures

Table 1
41ows the top 5 genes (NRGN, CTXN1, PCP2, NNAT, and CAMK2A) in the SA dataset predicted by ResSAT.NRGN (Neurogranin) is a gene encoding a protein found primarily in the brain, speci cally in dendritic spines of neurons.It is involved in regulating synaptic plasticity and learning processes by modulating the function of calmodulin, a calcium-binding protein.NRGN has been implicated in various neurological disorders, including Alzheimer's disease and schizophrenia 38 .CTXN1 (Cortexin-1) is a gene encoding a protein involved in neuronal development and synaptic function, especially highly expressed in cerebral cortex39.PCP2 (Purkinje Cell Protein 2) is a gene expressed primarily in Purkinje cells of the cerebellum40.It encodes a protein involved in dendritic development, synaptic transmission, and calcium signaling within Purkinje cells.Mutations in PCP2 have been linked to certain neurodevelopmental disorders.NNAT (Neuronatin) is a gene encoding a protein expressed in the brain, particularly in neurons, where it regulates neuronal development and function.It is involved in processes such as neuronal differentiation, synaptogenesis, and neurotransmitter release41.NNAT has been implicated in neurological disorders and metabolic regulation 42 .CAMK2A (Calcium/Calmodulin-Dependent Protein Kinase II Alpha) is a gene encoding a protein kinase involved in calcium signaling and synaptic plasticity.It plays a crucial role in neuronal excitability, synaptic transmission, and learning and memory processes.Dysregulation of CAMK2A has 43en implicated in various neurological disorders, including Alzheimer's disease and epilepsy43.These genes are known for their critical roles in neuronal function and brain development, reinforcing the model's capability across different brain regions and datasets.This consistency underscores ResSAT's reliability in predicting key genetic markers relevant to mouse brains.
remain low, indicating the challenging nature of the prediction task for the majority of genes.The low scores may stem from various factors, including the weak correlation between the expression of certain genes and morphological features, limitations in the detection of certain genes by the Visium platform resulting in less predictable expression, and the presence of experimental artifacts introducing non-biological variability into the data, independent of the image.Due to the relatively small number of tissue sections available, neither ResSAT nor other existing methods can reliably predict gene expression with high accuracy.However, ResSAT still demonstrates superior prediction accuracy compared to other methods.While the reliance on a relatively large training set poses a potential limitation for deep learning-based models, we anticipate that as more training ST data become available in the near future, ResSAT's performance and robustness can be further enhanced.